# Document Image Understanding
Qwen2.5 VL 72B Instruct FP8 Dynamic
Apache-2.0
FP8 quantized version of Qwen2.5-VL-72B-Instruct, supporting vision-text input and text output, optimized and released by Neural Magic.
Image-to-Text
Transformers English

Q
parasail-ai
78
1
Olmocr 7B 0225 Preview
Apache-2.0
A document OCR model fine-tuned based on Qwen2-VL-7B-Instruct, supporting multilingual document recognition and metadata extraction
Text Recognition
Transformers English

O
FriendliAI
322
1
Qwen2.5 VL 3B Instruct Quantized.w4a16
Apache-2.0
The quantized version of Qwen2.5-VL-3B-Instruct, with weights quantized to INT4 and activations quantized to FP16, designed for efficient vision-text task inference.
Text-to-Image
Transformers English

Q
RedHatAI
167
1
Qwen2.5 VL 72B Instruct FP8 Dynamic
Apache-2.0
The FP8 quantized version of Qwen2.5-VL-72B-Instruct, supporting vision-text input and text output, suitable for multimodal tasks.
Text-to-Image
Transformers English

Q
RedHatAI
1,837
3
Eagle2 9B
Eagle2 is a high-performance series of vision-language models focused on enhancing model performance through optimized data strategies and training methods. Eagle2-9B is the large model in this series, achieving a good balance between performance and inference speed.
Text-to-Image
Transformers Other

E
KnutJaegersberg
15
4
Eagle2 1B
Eagle 2 is a high-performance vision-language model family that focuses on transparency in data strategies and training schemes, aiming to drive the open-source community in developing competitive vision-language models.
Image-to-Text
Transformers Other

E
nvidia
1,791
23
Paligemma2 10b Ft Docci 448
PaliGemma 2 is a multi-functional vision-language model (VLM) launched by Google, which combines image and text processing capabilities and supports multilingual and multi-task processing.
Image-to-Text
Transformers

P
google
2,207
16
Florence 2 DocVQA
A version fine-tuned for 1 day using the Docmatix dataset (5% data volume) based on Microsoft's Florence-2 model, suitable for image-text understanding tasks
Text-to-Image
Transformers

F
impactframes
30
1
Paligemma 3b Ft Docvqa 896
PaliGemma is a lightweight vision-language model developed by Google, built on the SigLIP vision model and the Gemma language model, supporting multilingual image-text understanding and generation.
Image-to-Text
Transformers

P
google
519
9
Uae License Detection
MIT
Donut is an OCR-free document understanding Transformer model that combines a visual encoder and text decoder to process document images
Image-to-Text
Transformers

U
codedrainer
21
2
Donut Base Medical Handwritten Prescriptions Information Extraction Final
MIT
A medical handwritten prescription information extraction model based on the Donut architecture, designed to extract structured information from medical prescription images
Image-to-Text
Transformers

D
Javeria98
47
0
Thesisdonut
MIT
A model fine-tuned based on naver-clova-ix/donut-base, specific uses and functions require more information
Image-to-Text
Transformers

T
Humayoun
13
0
Donut Base Sroie
MIT
A document understanding model fine-tuned from naver-clova-ix/donut-base, specialized in structured document information extraction tasks
Text Recognition
Transformers

D
enoreyes
15
0
Donut Base Bol
MIT
A document understanding model fine-tuned from naver-clova-ix/donut-base, suitable for image folder datasets
Text Recognition
Transformers

D
prakriti42
13
0
Donut Base Sroie
MIT
A model fine-tuned on the image folder dataset based on naver-clova-ix/donut-base, suitable for document understanding tasks
Text Recognition
Transformers

D
zahra000
16
0
Donut Base Sroie Fine Tuned
MIT
A fine-tuned version based on the naver-clova-ix/donut-base model on an image folder dataset, suitable for document understanding tasks.
Text Recognition
Transformers

D
adrianccy
21
0
Featured Recommended AI Models